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Abstract 

We derive an optimal strategy in the popular Deal or No Deal game show. Q-learning 
quantifies the continuation value inherent in sequential decision making and we use 
this to analyze contestants risky choices. Given their choices and optimal strategy, 
we invert to find implied bounds on their levels of risk aversion. In risky decision 
making, previous empirical evidence has suggested that past outcomes affect future 
choices and that contestants have time-varying risk aversion. We demonstrate that 
the strategies of two players (Suzanne and Frank) from the European version of the 
game are consistent with constant risk aversion levels except for their last risk-seeking 
choice. 

1 Introduction 

Ever since the introduction of the popular television show Deal or No Deal, many au- 
thors have analyzed aspects of the game. Deal or No Deal provides an 'experiment' with 
large stakes and a relatively simple probabilistic structure. Using Q-learning (Watkins, 
1989, Whittle, 1982, Putterman, 1984, Poison and Sorensen, 2011) we address the ques- 
tion of optimal strategy. Our solution technique provides a dynamic maximum expected 
utility solution (Ramsey, 1926, de Finetti, 1937, von Neumann and Mortgensen, 1944). Q- 
learning is a popular reinforcement learning technique for calculating continuation values 
for sequential decision making. 
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In Deal or No Deal contestants are presented with a simple decision of whether to take 
a Banker's offer or to continue in the game. Hence the value realized from deciding to 
dechne the banker's offer - 'No Deal' - has a continuation value, a stylized fact of sequential 
decision making. Previous empirical evidence has suggested that past outcomes affect 
future choices and that contestants have time-varying risk aversion. One key feature of 
continuation values is that they can appear to lead to the same effect - namely current 
actions appearing to exhibit time- varying risk aversion - when in fact it is no more than an 
optimal action to exercise the continuation value of the game. To illustrate these effects, 
we consider a simple scenario with logarithmic utility. This also illustrates a theoretical 
rule of thumb in Deal or No Deal: one should generally continue as long as there are two 
big prizes left. 

We analyze data from two players, Frank and Suzanne, from the European version of 
the game show (Post et al, 2008). Tables 1 and 2 provide the contestants choices. For 
example, in round seven, after several unlucky picks, Frank opened the briefcase with the 
last remaining large prize (€500,000) and saw his expected prize tumble from €102,006 
to €2,508. The banker offered him €2,400 but Frank rejected the offer and continued to 
play. He finally ended up with a briefcase with only €10. In round nine, he even rejected 
a certain €6,000 in favor of a 50/50 gamble of €10 or €10,000 - clearly exhibiting risk 
seeking behavior. 

In contrast, Suzanne was a "lucky" player. In round nine see faced a 50/50 gamble of 
€100,000 or €150,000 (two of the three largest prizes in the German edition). While she 
was hesitant in the earlier rounds, she rejected the banker's offer of €125,000 - the expected 
payoff - and finally won the €150,000 prize. Our analysis will track their choices and infer 
bounds on their levels of risk aversion at each stage. We provide a separate analysis of the 
last stage of the game, as both players exhibit risk seeking behavior here. Other authors 
claim declining risk aversion of individuals after earlier expectations have been shattered 
by unfavorable outcomes. Our approach shows that the continuation value is in fact high 
and therefore risk aversion of contestants need not decline in the aforementioned case. 

The rest of the paper is outhned as follows. In Section 2 we discuss the Q- Learning 
technique and how it is applied to the Deal or No Deal game, including log and power 
utility examples. Section 3 solves for optimal strategy using Q-learning. Given an optimal 
strategy and the contestants empirical choices, we can then infer bounds on their risk 
aversion. We also analyze their terminal risk-seeking choices. Finally, Section 4 concludes. 
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2 Q-Learning and Deal-No Deal 



2.1 Deal or No Deal 

In the game Deal or No Deal, players are presented with a choice of briefcases which hold 
distinct monetary prize values. Initially, players are asked to select one case, which will be 
referred to as 'their case' and remain unopened. Then, they proceed by choosing, and thus 
eliminating a predetermined number of cases each round. The values of these eliminated 
cases are shown to the player. At the end of each set of case eliminations, an entity referred 
to as 'The Banker' then presents the player with a monetary offer in exchange for their 
case. At this point, the player is given two options: either take the Banker's offer (i.e. 
Deal) or continue ehminating cases (i.e. No Deal). The game continues until an offer is 
accepted or there is only one case remaining. 

Prom this setup, we can decompose the game into a set of states. Let the set V contain 
all the possible prize values in the initial suitcases. Let the set S consist of all possible 
combinations of prize values from the given set of possible prize values V. Let, s G 5" be 
one of the sets of possible remaining suitcase prize values. At each state s, the player is 
given a set of possible actions A, defined as follows: 



After performing an action a & A, the player then experiences a transition from state to 
state. We define the payoff to the player as a state-action map onto the space of real 
numbers: Q : S x A ^ 'R. The goal is simply to maximize the discounted resulting value 
of this mapping. Q-Learning allows us to maximize utility over a set of possible decisions, 
providing a map of the optimal path of actions. 

2.2 Optimal Strategy: Q-Learning 

In this subsection, we describe the basic ingredients of Q-learning. First, we let the agent's 
immediate utility, taking the 'Deal', be denoted by u{st, at). The Bellman principle of op- 
timality states that the optimal solution path a*{s) is the solution to the Bellman equation 
defined by a value function V{s) that satisfies 



A = {0, 1} = {'No Deal', 'Deal'} 




3 



Here p{s'\s,a') is the transition matrix throughout states given action a'. In Deal or No 
Deal, actions cannot affect the state's evolution, so we write p{s'\s). Rather than directly 
computing the value function, we instead calculate the matrix of Q- values. They are defined 
as the total expected utility gained by choosing a current action a and following the optimal 
path thereafter. 

An optimal policy must satisfy Bellman's principle of optimality: that an optimal path 
has the property that whatever the initial conditions and control variables (choices) over 
some initial period, the control (or decision variables) chosen over the remaining period 
must be optimal for the remaining problem, with the state resulting from the early decisions 
taken to be the initial condition. 

To solve for this, we need a transition matrix for probabilities, a utility function and 
a banker's valuation function. To fix notation, let s G 5 denote the current state of the 
system and a G A an action. Define the Q-value, Qt{s,a), at time t by the value of using 
action a today and then proceeding optimally in the future. The Bellman equation for 
Q-values becomes: 



where S* G S is the set of all possible next period states given the current action a, namely 
S* — {s* : s* G V{s) n = |s| — 1}. The value function and optimal action are then 
simply given by: 



In our discrete setting, we can directly find the Q-values without resorting to the sim- 
ple stochastic approximation algorithms given in Watkins (1989) and Watkins and Dayan 
(1992). We now define the ingredients to solve the problem: 

Transition Matrix. Since there is equal probability that a player chooses any of the 
existing prize values, we define the transition probabilities as follows: 




s*es* 



V{s) = msLxQ{s,a) and a*{s) = argmax Q{s,a) 




P{s*\s,a = 1) 



1 



1 



s 



For example, when you have three prizes left, with s the current state 



S* = {all subsets of two prizes} and P{s*\s,a = 1) 



1 
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where the transition matrix is uniform to the next state. 

There is no continuation value for taking the Deal. If action a = is chosen, then the 
value realized by the player is either the utility of the offer presented by the banker 
or the utihty of the value of the player's case (if no other cases remain). 

Banker's Function B{s). There are a number of different choices for modeling the banker's 
function. In the live TV show one only sees the current banker offer and of course, 
to address the issue of optimal policy and the continuation value we need to know 
what they would offer in future states of the world. One popular choice is expected 
value: let v e 5 be a possible prize, then 



where s is the set of the remaining prize values. Another choice is from the on- 
line version of the game where the website (www.nbc.com/DealOrNoDeal) uses the 
following criteria: let big and small denote the biggest and smallest prizes left on the 
board, respectively. Then, the banker offer is given by 

• With 3 prizes left: B{s) = 0.305 • big + 0.5 • small 

• With 2 prizes left: B{s) = 0.355 • big + 0.5 • small. 

This is not the case in the TV show as the Banker has some discretion on the offer. 
Empirically, it almost strictly holds that: B{s) < s - the expected value of the 
remaining prizes. 

Utility. The utility of the next state depends on the contestant's value for money and the 
bidding function B{s) of the banker. For example, in the case of CRRA power utility 



with log-utility u{B{s)) = \n{B{s)) a special case. A more flexible choice is the 
exponential-power utility function 



where W is current wealth. For the purpose of our analysis, we focus on the natural 
logarithm and CRRA cases. 




V 



u{B{s)) 



-1 



(1 - exp {-a{W + B{s)y-^)) 
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2.3 Illustrative Example: Log-Utility 

To show that the continuation value can be large we consider an example where there 
are three prizes left including two large ones, s = {750,500,25}. Let us take an example 
where the contestant is risk averse with log-utility: u{x) — Inx. This is equivalent to the 
well-known Kelly (1956) criterion. The contestant would be indifferent to a coin toss that 
doubled or halved their wealth. 

For a base case analysis, suppose that the Banker's offers are determined by the expected 
value of the prizes left in the set s. With log-utility this will look like a good deal in a 
one-shot version of the game. For this example, the utility of the offer is 

u{B{s = {750, 500, 25})) = ln(1275/3) = 6.052 

Taking the deal leads to a utility Qt{s, a — 0) — 6.052. 

However, we have to compare this to the continuation problem ('No Deal'). The set of 
future possible states is S* — {s^, s^} where 

s* = {750, 500} , s* = {750, 25} , = {500, 25} 

As the banker offers the expected value, if the contestant picks 'No Deal' we will have offers 
of 625, 387.5, and 137.5, respectively. This gives the following Q-value calculation: 

Qt{s, a = 1) = ^2 -P(**k) a = 1) max(5t+i(s*, a) 

s*&S* 

= ^ (ln(625) + ln(387.5) + ln(262.5)) = 5.989 

with immediate utility given by u{s, a) — 0. 

Therefore, as Qt{s,a = 1) = 5.989 < 6.052 = Qt{s,a — 0) the optimal action for the 
player at time t is a* — 0, 'Deal'. We see that the continuation value was not large enough 
to overcome the generous (expected value) offer by the banker. 

This analysis also illustrates the rule-of-thumb that most players should proceed as long 
as there are two large prizes left as the continuation value is high. 

2.4 Sensitivity Analysis - Different Banker's Function 

The magnitude of the continuation value is related to the Banker's bidding function. If we 
now use the Banker's bidding function provided by the on-line game, defined by: 

B{s) = 0.355 • big -|- 0.5 • small (with two prizes remaining) 
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Hence, we can evaluate the banker's function at each of the s* next period states: 

B{s\ = {750,500}) = 516.25 
B{s* = {750,25}) = 278.75 
B{sl = {500,25}) = 190 

Using these, let us consider the optimal action with 2 prize values left for the player. To 
do so, we calculate the following Q-values: 

Qt+iist, a = 1) = ^ {ln(750) + ln(500)} = 6.415 

Qt+^{s*, a = 0) = In (516.25) = 6.246 

Since Qt+i{si, a — 1) > Qt+i{si, a — 0), the future optimal policy is 'No Deal' under s^. 

Qt+i{4, « = 1) = ^ {ln(750) + ln(25)} = 4.9194 
Qt+i{s2, a = 0) = In (278.75) = 5.63 
Since Qt+i(s2, a = 1) < Qt+i{s2, a = 0), the future optimal policy is 'Deal' under 

Qt+i{4: « = 1) = ^ {ln(500) + ln(25)} = 4.716 

Qt+i{sl, a = 0) = In (190) = 5.247 

Since Qt+i{s^, a = 1) < Qt+i{s^, a ~ 0), the future optimal policy is 'Deal' under Sg. 
Now, solving for Q-values at the previous step gives 



Qt{s, a = 1) = P{s*\s, a = 1) max(5t+i(s*, a) 
s*es* 

= - (6.415 + 5.63 + 5.247) = 5.764 

3 

with a monetary equivalent of exp(5.764) = 318.62. This is the continuation value (or 'No 
Deal' decision value) at the current time period t. We can compare this with the 'Deal' 
decision value: 

Qt{s, a = 0) = u{B{s = {750, 500, 25})) 
= In (0.305 • 750 + 0.5 • 25) 
= ln(241.25) = 5.48 
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The Banker then offers the contestant a monetary value of 241.25. Now, comparing the 
Q- values of the decisions gives the optimal action of a* = 1, 'No Deal' as 



Q^{s, a=l) = 5.7079 > 5.48 = Qt{s, a = 0) 

We get this result because the continuation value is large. Essentially we are considering 
the difference between $241 compared to $319, or a 33% premium. 

To extend this analysis to include higher levels of risk aversion we consider can power 
utility: 

u{x) 

1-7 

where 7 = —u"{x)/u{x) is a local measure of risk aversion, first introduced by de Finetti 
(1952). We note that as 7 tends to 1, the utility function above tends toward the afore- 
mentioned logarithmic utility function. Therefore, the logarithmic utility function is just a 
special case of the more general power utility function. 

With the same three prize values remaining: s — {750, 500, 25} and an the expected 
value criterion for the banker's function, we get utilities: 

u(B(s = {750, 500, 25})) = (^275/3)^-^-1 

1 - 7 

Therefore, the utility of 'Deal' is Qt{s, a = 0) = 

As before, we consider the continuation problem ('No Deal') of the set S* = {s*, Sg, "S3} 
with expected value banker offers leading to the values 625, 387.5, and 137.5, respectively. 
Performing the Q- value calculation gives: 

Qt{s, a = 1) = P{s*\s, a = 1) ma.xQt+i{s*, a) 



_ 1 / (625)^-^ - 1 (387.5)^-T - 1 (262.5)^"^ - 1 \ 
3\1 — 7 1 — 7 1 — 7 J 

_ (625)i-T + (387.5)i-T + (262.5)^"^ - 3 
" 3(1-7) 

with immediate utility u{s, a) = 0. 

Hence, as the inequality Qt{s, a = 1) < Qt{s, a = 0) holds for all levels of risk aversion 
< 7 < 00, the optimal action is a* = 0, 'Deal'. Once again, we can see that the 
continuation value was not large enough to overcome the generous (expected value) offer 
by the banker. 
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2.5 Sensitivity Analysis - Different Banker's Function 

Now, using the Banker's bidding function from the website, we once again have: 

B{s) = 0.355 • big + 0.5 • small (with two prizes remaining) 

Prom the previous section, we found that B{s1) = 516.25, B{s^) = 278.75, and B{s^) = 190. 

We now consider the optimal action with 2 prize values left for the player. To do so, 
we calculate the following Q- values: 

^ 1 /750^-^- 1 500^-T-l\ 750^-^ + 500^-^-2 
Qt+i{s*, a = 1) = - — + 



1-7 1 - 7 y 2(1-7) 

_ , , ^, 516.25^-^ - 1 

Qt+i{s*^,a = 0) = 

1-7 

Since Qt+i{sl, a = 1) > (5t+i(s*, a = 0) for all levels of risk aversion < 7 < 00, the future 
optimal policy is 'No Deal' under 

^ 1 /750^-^-1 25^-"'- 1\ 750^-^ + 25^-^-2 

gt+i(s2,a = 1) = - 



2 V 1-7 1-7 y 2(1-7) 

278.75^-^ - 1 



1-7 

Here, for < 7 < 0.5602, wc find that Qt+i{s2,0' = 1) > Qt+i(s2,a = 0) (so the optimal 

future policy is 'No Deal' under 52)- On the other hand, for 0.5602 < 7 < 00, wc find that 

Qt+i{s2, a = 1) < Qt+i{s2, a = 0) (so the future optimal policy is 'Deal' under S2). 

^ , 1 /500^-^-1 25i-^-l\ 500^-^ + 25^-^' - 2 
Qt+i(sl a = 1) = - — + 



2VI-7 1-7/ 2(1-7) 

^ , ^ , 190^-^ - 1 

Qt+i{4: a = 0) = — 

1-7 

Here, for < 7 < 0.5175, we find that Qt+i{s^,a = 1) > Qt+i{s^,a — 0) (so the optimal 
future pohcy is 'No Deal' under Sg). On the other hand, for 0.5175 < 7 < 00, we find that 
(5t+i(s3, a — 1) < (5t+i(s3, a — 0) (so the future optimal policy is 'Deal' under S3). 
Solving for the Q- values at the previous step gives 

Qt{s, a = 1) = 'Y^ P{s*\s, a = 1) maxQt+i{s*, a) 

s*€S* 

1 / 750^~^+500^~T-2 I 750^-T+25i-'y-2 , 500^-t+251~T-2 \ ;f n ^ ^ H ccI 7c; 

3 27wi ^ 2(T3vi ^ 27T37i n u < 7 < u.oiio 



3 V 2(1-7) ' 2(1-7) ' 2(1-7) 

1 / 750^~"'+500^~T-2 I 750^~T+25^~T-2 , igpi-T-i 



3V 2(1-7) + ""2(1-7) ' + '^) if0.5175<7< 0.5602 

1 / 750^~T+500^~T-2 
L 3 I, 2(1-7) 



+ 278.g-^-i ^ i9^Zl=i) if 0.5602 < 7 < 00 
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Prom this, we can calculate the following monetary equivalents: 



((l-7)gt(s,a = l) + l) 




2(750^-T+190^~'^)+500i-T+25i-T \ i-7 

6 ) 
2(278.75i-T+190i-T)+750i-T+500^-'y ^ 
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if < 7 < 0.5175 



if 0.5175 < 7 < 0.5602 
if 0.5602 < 7 < oo 



These are the continuation values (or 'No Deal' decision values) at the current time period. 
We can compare these with the 'Deal' decision value: 



Hence, the Banker offers the contestant a monetary value of 241.25. Now, comparing the 
Q- values of the decisions: 

• Qt{s, a= 1) > Qt{s, a = 0) if < 7 < 4.5963 with optimal action a* = 1, 'No Deal'. 

• Qt{s, a — 0) > Qt{s, a = 1) if 4.5963 < 7 < oo with optimal action a* = 0, 'Deal'. 

Therefore, we can see that for most risk aversion levels (except for really high values), 
the optimal action will be to choose 'No Deal'. This is because the continuation value is 
high relative to the value of the banker's offer. Once again, for common risk aversion levels, 
we can approximately reduce the Q-Learning results to a simple rule of thumb. That is, 
continue as long as there are two large prizes left. 



3 Analyzing Risky Choices: Suzanne and Frank 



A number of authors have argued that backwards induction appears most relevant in the 
early rounds of the game and that there is no difference with the myopic rule as the Banker's 
offers (near expected value) lead risk averse players to proceed. Our approach has shown 
that there is a significant continuation value at the end of the game. Take for example, 
a set of three prizes containing two large ones. Risk averse people will naively choose the 



Qt{s,a = 0) 



u{B{s = {750,500,25})) 
(0.305 -750 + 0.5 -25)^-' 



1-7 
(241.25)1-^ - 1 
1^ 
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action 'Deal', when if they incorporated the continuation value, they would choose 'No 
Deal'. Of course, this is sensitive to the banker's offer function. If risk aversion is very 
high as manifested by the contestant's utility function, then the continuation value may 
not always be high. 

Post et al (2008) proposes that path-dependence factors heavily into the choices of 
contestants, and in fact the choices can be explained by varying levels of risk aversion as 
the game progresses favorably or unfavorably for the contestant. We find that this does 
not have to necessarily be the case since choosing to turn down a Banker's offer does not 
have to imply decreasing risk aversion, but only a higher continuation value present by 
removing the myopia restriction/assumption on the contestants. 

As well, as the game progresses, we generally see that we can place increasingly restric- 
tive upper bounds on the risk aversion coefficient 7. This can be done whenever we observe 
a contestant choose the action 'No Deal'. However, since we cannot place a lower bound 
on this parameter 7 until we observe a contestant choose the action 'Deal', we cannot nec- 
essarily infer that the risk aversion level of a player is decreasing. This is because at the 
point that the action 'Deal' is taken, the game immediately ends. Therefore, we cannot 
observe a future action choice which requires a lower 7 parameter. 

We now use the examples of the real German contestant Susanne, as well as the Dutch 
contestant Frank (both from Post) to illustrate our concept. 

3.1 Example - Suzanne from Germany 

We start by analyzing Susanne's four remaining prizes {0.5, 1,000, 100,000, 150,000} at in 
round 7. We let t = 7 and work backwards from the final stage, time t + 2 — 9, where two 
prizes remain. Hence, we evaluate all possible outcomes from the current state at time t in 
order to determine Q- values for this current state. 

First, we consider the expected banker's offers for all of the possible outcomes at t + 2. 
Since for Susanne, we can only observe the Banker's ofi'er for a single outcome, we use 
the observed percent of expected value of the given ofi'er as a multiplier on the expected 
value of all other potential outcomes. For time t + 2, this is mt+2 = 100%. Using this, we 
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calculate the banker's function for all other possible outcomes: 



5(4+2,1 = {0.5, 1,000}) = * 4+2,1 = 500.25 
5(4+2,2 = {0-5, 100,000}) = mt+2 * 4+2,2 = 50,000.25 
5(4+2,3 = {0.5, 150,000}) = mt+2 * 4+2,3 = 75,000.25 
5(4+2,4 = {1,000, 100,000}) = mt+2 * 4+2,4 = 50,500 
5(4+2,5 = {1,000, 150,000}) = mt+2 * 4+2,5 = 75,500 
5(4+2,6 = (100,000, 150,000}) = mt+2 * 4+2,6 = 125,000 

Therefore, we get the following Q- values for the decisions 0^+2,1 = 0,i e {1,...,6}: namely 
Qt+2 (st+2,j) for J £ {1) ■ ■ ■ , 6}, by plugging into the power utility function. 

Similarly, we can work out the Q- values for the decisions 0^+2,1 = l,i G {1,...,6}. For 
simplicity of notation we define Qt{i,j) — Qti^t^i, 0't,i — j)- 

^ 1/0.5^-^- 1 1,000^-^-1 
Qt+2(1,1) = - — + 



2V1-7 1-7 
1 / o.5^-^ - 1 100,000^-^ 

Vt+2(,^, -LJ — o Z T 



2VI-7 1-7 

1 / o.5^-^ - 1 150,000^-^ - 1 

Vt+2(,0, ij — - h 



2 V 1 -7 1-7 

^ , 1 /1,000^-T-l 100,000^-^ 
Qt+2(4,1) = -^- + 



2 V 1-7 1-7 



^ 1 /1,000^-^- 1 150,000^-^' - 1\ 

^ 1 /100,000^-^- 1 150,000^-^-1 
Qt+2(6, 1) = - h 



2V 1-7 1-7 

Comparing these Q- values at t + 2, we can determine the optimal actions at each of the 
possible states. That is, choose 0^+2,1 = 0, i e {1, . . . , 6} since 

Qt+2{i,0) > Qt+2{i,l) 

Now, we consider the expected banker's offers at each of the possible states at time t + 1. 
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Here we use an expected value multiplier of mj+i = 90%: 



5(s*+i 1 = {0.5, 1,000, 100,000}) = mt+i * s^Vi^i - 30,300.2 
B{st+i,2 = {0.5, 1,000, 150,000}) = rut+i * = 45,300.2 
B{st+i,3 = {0.5, 100,000, 150,000}) = rut+i * 3 = 75,000.2 
^(«t+i,4 = {1,000, 100,000, 150,000}) = mt+i * 4 = 75,300 

Using these, we get the following Q- values for the decisions at+i^i = for i e {1, . . . , 4}: 

i — 7 i — 7 

^ , , 75,000.2i-T - 1 ^ , 75,300^-'^ -1 

Qi+i(3,0) = ^— , gt+i(4,0) = 

i — 7 i — 7 

Using the optimal actions at i + 2, we can calculate the continuation Q- values at i + 1, that 
is where at+i,i = 1, i G {1, . . . , 4}: 

^ , , 1 /500.25^-^-1 50,000.251-T - 1 50,500^-^-1 
1) = o — + -, „ + 



3\ 1 — 7 1 — 7 1 — 7 

^ , , 1 /500.25^-T- 1 75,000.25^-^- 1 75,500^-^-1 
Qt+i{2, 1) = - -— + , + 



3\ 1 — 7 1 — 7 1 — 7 

1 / 50,000.25^-^ - 1 75,000.25^-^ - 1 125,000^"^ - 1 

Qt+li'i, 1) — 77 \ ^ 



3\ 1 — 7 1 — 7 1 — 7 

^ , 1 /50,500^-^- 1 75,500^-^- 1 125,000^-^-1 



3\ 1 — 7 1 — 7 1 — 7 

Thus, we can see that: 

Qt+i{l, 0) < Qt+i{h 1) if 7 < 0.22617 , Qt+i(2, 0) < Qt+i{2, 1) if 7 < 0.22077 
Qt+iiS, 0) < Qt+i{S, 1) if 7 < 1-50645 , Qt+i(4, 0) < Qt+i(4, 1) if 7 < 1.54085 

Since we observe that Suzanne, when faced with St_|_i^4, chose action 0^+1,4 = 1, we must 
have that her 7 < 1.54085. 

Finally, we can now consider the banker's offer at the current state at time t. Here, we 
see that the expected value multiplier is rrit = 73.31%: 

5(s* = {0.5, 1,000, 100,000, 150,000}) = * s* = 46,000 



13 



Therefore, we get the following Q- value for the decision = 0: 

46 000^""^ - 1 



1-7 

Using the optimal actions at t + 1, we calculate the continuation Q- values at t, with at = 1 
and determine 



Qt{s*t,at 



i [Qt+i{l, 1) + Qt+i(2, 1) + Qt+iiS, 1) + Qt+i(4, 1)] if 7 < 0.22077 

i [Qt+iil, 1) + Qt+i{2, 0) + gi+i(3, 1) + gt+i(4, 1)] if 0.22077 < 7 < 0.22617 

i [Qi+i(l, 0) + Qi+i(2, 0) + Qi+i(3, 1) + gt+i(4, 1)] if 0.22617 < 7 < 1-50645 

1 
4 



4 

i [Qi+i(l, 0) + Qi+i(2, 0) + Qi+i(3, 0) + gt+i(4, 1)] if 1.50645 < 7 < 1-54085 



where Qt{i,j) = Qtisl^, at^i = j). 

Using our optimal choice of at+i for a given value of 7 we have: 

• If 7 < 0.22077, then Qt(s*, = 0) < Qt{sl at = 1) for 7 < 0.68666. 

• If 0.22077 < 7 < 0.22617, then Qtis*, at = 0) < Qtis*, at = 1) for 7 < 0.87506. 

• If 0.22617 < 7 < 1.50645, then Qt(s*, at = 0) < Qt{s*t, at = 1) for 7 < 2.61618. 

• If 1.50645 < 7 < 1.54085, then Qt(s*,aj = 0) < Qtis*, at = 1) for 7 < 2.73117. 

Therefore, for 7 < 1.54085 the optimal decision at time t is at = 1, 'No Deal', since the 
continuation value is the larger. This is consistent with the choice that Susanne made. 

Now, we can plot the evolving monetary values of each action a at each point in time. 
In figure [H the solid line represents the value of the Banker's offer (a = 0), and the four 
dotted lines represent the continuation value (a = 1) for a range of ascending values of the 
risk aversion parameter 7. At each point in time, the contestant will choose the action with 
the larger value. Therefore, we see that Susanne will prefer action at+i = 1 at time t + 1 
if her risk aversion parameter is 7 < 1.54085. However, at time t + 2, all values of 7 > 
will induce the optimal action at+2 = since the value of the Banker's offer is larger than 
all possible continuation values. Naturally, at time t + 3, where the contestant holds only 
one case, there is complete certainty about the prize value, therefore (for completeness) the 
Banker's offer has converged to the value of that prize. 

Also, it is important to note that although we have only considered the final stages 
of Susanne's game, it is unlikely that an earlier stage has placed a more restrictive upper 
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bound on the risk aversion parameter. Since the Banker's offer as a percentage of expected 
value is increasing over time, choosing the action 'Deal' early on in the game becomes very 
unattractive. Only a contestant with a very large risk aversion level will choose 'Deal' 
early and forgo the almost certain increasing percentage of expected value implicit in the 
Banker's offer function. 



Banker's Offer vs. Continuation Value - Susanne 
160000 I , , 




40000 ' ' ' ' 

t t+1 t+2 t+3 

Time 



Figure 1: Susanne's evolving values of the two actions for varying levels of risk aversion 

Thus, we can see that given a constant risk aversion level of 7 < 1.54085, it is possible 
to rationally observe the actions of the real player, Susanne, except for the final time period 
t + 2. Seemingly, only a completely risk-neutral or even risk-seeking player would be willing 
to turn down an expected value offer in exchange for a fair gamble on a set of prizes as per 
Jensen's Inequality. However, we have observed Susanne turning down an offer of 125,000 
for a 50/50 gamble on the set of prizes {100,000, 150,000}. What could cause her to do so 
while allowing us to not deviate from the constant risk aversion level hypothesis? 
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3.2 Example - Frank from the Netherlands 

Consider Frank's four remaining prizes {0.50, 10, 20, 10,000} in round t = 7. As he plays 
through his remaining rounds, Frank is presented with the following remaining prizes at 
each period: 

St = {0.5, 10, 20, 10,000}, st+i = {10, 20, 10,000}, St+2 = {10, 10,000}, St+3 = {10} 

At each one of these periods, we observe the following Banker's offers and implied m values: 

B{st) = 2,400 = mt*st = mt* 2507.63 rrit = 95.71% 

B{st+i) = 3,500 = mt+i * St+i = mt+i * 3343.33 mt+i = 104.69% 

S(st+2) = 6,000 = mt+2 * = * 5005.00 ^ mt+2 = 119.88% 

Interestingly, the Banker's offer percentage is very high with respect to expected value. In 
fact, the final two offers are significantly above expected value. 

At time period t + 2 we can see that Frank turns down the offer of 6,000 for the 50/50 
gamble between 10 and 10,000. Short of Frank being risk averse and acting completely 
irrationally, there are two possible explanations for this observed action. Either he truly 
has a risk aversion level characterized by 7 < (i.e. he is risk seeking), or he has a large 
"enjoyment" benefit from playing the game (i.e. his b is quite large). It is possible that 
if he has a large enough value of b, that his choice of 'No Deal' at time t + 2 is not risk 
seeking, but actually risk averse! 

As an additional point, looking at his decision at time t + 1, we can see that even though 
Frank turns down a Banker offer larger than the expected value, this choice may not have 
been risk seeking even if he has an "enjoyment" benefit of 6 = 0! To see this, consider 
the continuation value: if Frank expects that the Banker's expected value offer percentage 
m is increasing in each round, he may get an even larger Banker's expected value offer 
percentage m in the next round. Therefore, even if the m value in one round is larger than 
100%, an observed action of 'No Deal' does not preclude risk aversion if an even larger m 
value is expected in the next period. Plainly, this is caused by a large continuation value. 

Similarly as we did for Susanne, figure [2] shows the evolving monetary values of each 
action a at every point in time. 
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Banker's Offer vs. Continuation Value - Franl< 



6000 



5000 



4000 



■S 3000 
> 



2000 - 



1000 




Time 



Figure 2: Frank's evolving values of the two actions for varying levels of risk aversion 



3.3 The terminal risk-seeking choice 

One interesting feature of Suzanne's and Frank's choices are their risk seeking behaviour 
at the terminal decision. Clearly, there is no continuation value left in the game. How 
irrational are these decisions? They are gaining enjoyment from other sources such as: 
enthusiasm from playing the game, audience encouragement, and even excitement from 
being on TV. If we assume that the marginal enjoyment benefit is positive at each further 
stage of the game since a player would likely prefer to play the game longer than shorter. If 
we let the marginal benefit to Susanne to play the final round of the game to be the value 
b, we then have the following condition to turn down the Banker's final offer: 



125,000^-^ 
1-7 



1 1 
— < - 
- 2 

7 = 



(100,000 + by-^ - 1 ^ (150,000 + by-^ 



1 



1-7 



1-7 



1.54085 



3,761.90 



Therefore, we can see that even at the highest possible risk aversion level of 7 = 1.54085, 
Susanne only needs an "enjoyment" benefit of €3,761.90 in order to justify her 'No Deal' 
choice. 
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4 Discussion 



The TV game show Deal or No Deal is well suited for analyzing risky choices. The stakes 
are high and the outcomes for the contestants can range from a multi-mullion payday 
to empty-handed. The game involves only binary decisions and there are no subjective 
probabilities as the odds are well-defined ahead of time. We show, however, that the 
sequential nature of the game induces a significant continuation value that has been missed 
by a number of researchers. Choices can appear irrational and risky when in fact they are 
just an optimal strategy with reasonable (constant) parameters for risk aversion in a simple 
CRRA model of utility and choice. 

Our approach uses Q-learning, a popular technique in the reinforcement learning lit- 
erature, to solve for the optimal strategy. We then show that choices provide bounds on 
risk aversion parameters conditional on optimal play. We analyze the Prank and Suzanne 
data-sets from Post et al (2008) which tests the Thaler and Johnson (1990) hypothesis that 
many choices are affected by past outcomes. While Post et al (2008) conclude that many 
of the choices are the effect of time- varying risk aversion, our empirical findings are more 
in line with Bombardini and Trebbi (2007) who propose a dynamic expected utility model 
for the Itahan version of the game and conclude that contestants have constant levels of 
risk aversion but players are extremely heterogeneous in their beliefs. 

We note that the idea of time- varying risk aversion has been captured in the economics 
and finance literature in a number of ways. A common approach is via habit formation 
as shown by Campbell and Cochrane (1999) and Brandt and Wang (2011). Nevertheless, 
this seems unrealistic in such a short lived game. Other interesting utility functions - 
where Q-learning can easily be applied - are prospect theory which predicts possible time 
inconsistencies in gambling attitudes as discussed by Barberis (2011) or recursive utility 
methods as proposed in Kreps and Porteus (1978). We emphasize here that the neo-classical 
constant relative risk aversion model provides a realistic model for the contestants' attitudes 
studied here except for their final risk seeking choices. 
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Table 1: Susanne's Choices 



PrizG 


1 


2 


3 


4 


5 


6 


7 8 


9 


€0.01 


X 


X 


X 


X 










€0.20 


X 


X 


X 












€0.50 


X 


X 


X 


X 


X 


X 


X 




€1 


















€5 


















€10 


















€20 


X 


X 














€50 


X 


X 














€100 


X 


X 


X 


X 










€200 


















€300 


X 


X 


X 












€400 


X 
















€500 


















€1,000 


X 


X 


X 


X 


X 


X 


X X 




€2,500 


X 


X 


X 


X 


X 


X 






€5 000 


X 
















€7,500 


















€iU,UUU 


X 


X 














€12,500 


X 


X 


X 












€15,000 


X 
















€20,000 


X 


X 














€25,000 


X 


X 


X 


X 


X 








€50,000 


X 
















€100,000 


X 


X 


X 


X 


X 


X 


X X 


X 


€150,000 


X 


X 


X 


X 


X 


X 


X X 


X 


€250,000 


X 
















Average € 


32,094 


21,431 


26,491 


34,825 


46,417 


50,700 


62,750 83,667 


125,000 


Offer € 


3,400 


4,350 


10,000 


15,600 


25,000 


31,400 


46,000 75,300 


125,000 


Offer % 


11% 


20% 


38% 


45% 


54% 


62% 


73% 90% 


100% 


Decision 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal No Deal 


No Deal 
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Table 2: Prank's Choices 



Prize 


1 


2 


3 


4 


5 


6 


7 


8 


9 


€0.01 


X 


X 
















€0.20 


X 


X 
















€0.50 


X 


X 


X 


X 


X 


X 


X 






€1 


X 


X 


X 


X 


X 










€5 




















€10 


X 


X 


X 


X 


X 


X 


X 


X 


X 


€20 


X 


X 


X 


X 


X 


X 


X 


X 




€50 




















€100 




















€500 




















€1 000 


X 


















€2 500 


X 


X 


X 














€5 000 


X 


X 
















€7,500 




















€10,000 


X 


X 


X 


X 


X 


X 


X 


X 


X 


€25 000 


X 


X 
















€50,000 


X 


X 


X 


X 












€ / o,uuu 


X 


X 


X 














€100,000 


X 


X 


X 














€200,000 


X 


X 


X 


X 












€300,000 


X 


















€400,000 


X 


















€500,000 


X 


X 


X 


X 


X 


X 








€1,000,000 


X 


















€2,500,000 




















€5,000,000 


X 


















Average € 


383,427 


64,502 


85,230 


95,004 


85,005 


102,006 


2,508 


3,343 


5,005 


Offer € 


17,000 


8,000 


23,000 


44,000 


52,000 


75,000 


2,400 


3,500 


6,000 


Offer % 


4% 


12% 


27% 


46% 


61% 


74% 


96% 


105% 


120% 


Decision 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal 


No Deal 
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